Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Phi-3.5-vision-instruct #1094

Merged
merged 17 commits into from
Dec 15, 2024
Merged

Add support for Phi-3.5-vision-instruct #1094

merged 17 commits into from
Dec 15, 2024

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Dec 11, 2024

Example: Single-frame (critique an image)

import {
  AutoProcessor,
  AutoModelForCausalLM,
  TextStreamer,
  load_image,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
  legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: {
    vision_encoder: "q4", // 'q4' or 'q4f16'
    prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
    model: "q4f16", // 'q4f16'
  },
});

// Load image
const image = await load_image("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/meme.png");

// Prepare inputs
const messages = [
  { role: "user", content: "<|image_1|>What's funny about this image?" },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
  tokenize: false,
  add_generation_prompt: true,
});
const inputs = await processor(prompt, image, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
  ...inputs,
  streamer,
  max_new_tokens: 256,
});

Or, decode the output at the end:

// Decode and display the answer
const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(generated_ids, {
  skip_special_tokens: true,
});
console.log(answer[0]);
Example output

The humor in this image stems from the exaggerated depiction of human evolution, using the Shiba Inu dog breed to represent both ancient and modern humans. The left side shows a muscular, hunter-like figure labeled as 'Humans 100,000 years ago' with the caption 'me hungry me hunt mammoth,' suggesting a time when humans were physically robust and actively hunting. The right side contrasts this with a modern, slim Shiba Inu labeled as 'Humans today' with the caption 'why food delivery slow,' humorously commenting on the modern human's reliance on convenience and technology, such as food delivery services, rather than hunting for sustenance. The use of a dog, which is often associated with loyalty and companionship, adds a layer of irony and humor as it portrays humans in a more diminished, dependent state.


Example: Multi-frame (summarize slides)

import {
  AutoProcessor,
  AutoModelForCausalLM,
  TextStreamer,
  load_image,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
  legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: {
    vision_encoder: "q4", // 'q4' or 'q4f16'
    prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
    model: "q4f16", // 'q4f16'
  },
});

// Load images
const urls = [
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-1-2048.jpg",
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-2-2048.jpg",
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-3-2048.jpg",
];
const images = await Promise.all(urls.map(load_image));

// Prepare inputs
const placeholder = images.map((_, i) => `<|image_${i + 1}|>\n`).join("");
const messages = [
  { role: "user", content: placeholder + "Summarize the deck of slides." },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
  tokenize: false,
  add_generation_prompt: true,
});
const inputs = await processor(prompt, images, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
  ...inputs,
  streamer,
  max_new_tokens: 256,
});
Example output

To summarize, the slides are composed of these sections:

  • Introduction to Azure:
    The presentation introduces Microsoft Azure, a cloud computing platform. It highlights Azure's three service tiers: Hyper-scale, Enterprise, and Hybrid. The presenter is Dinesh Kumar Wickramasinghe, a Senior Software Engineer from CMS Private Limited in Sri Lanka.

  • Azure Overview:
    Azure is described as Microsoft's cloud computing platform, continuously expanding to meet current and future business challenges. It offers freedom to build, manage, and deploy applications on a global network using preferred tools and frameworks.

  • Cloud Computing Services:
    The presentation outlines three types of cloud computing services provided by Azure: Infrastructure-as-a-Service (IaaS) with a 'host' component, Platform-as-a-Service (PaaS) with a 'build' component, and Software-as-a-Service (SaaS) with a 'consume' component.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit 5334e7e into main Dec 15, 2024
4 checks passed
@xenova xenova deleted the add-phi3_v branch December 15, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants